Introduction to R Programming

STA6235: Modeling in Regression

Introduction

  • Today, we will discuss R programming in broad strokes.

    • It is, unfortunately, impossible for me to teach you everything you need to know about R.

    • My goal is to give you the building blocks and build your confidence.

  • Today’s lecture is taken from the following:

  • There are lots of resources out there on R!

    • My process: Google the thing I want to do with either “R” or “tidyverse”

      • e.g., “create new variable tidyverse” or “logistic regression R”
    • Posit cheatsheets (free!) are also helpful.

R functions

  • There are many functions that live in “base R”

  • If we want to use other functions, we need to call in the package they’re stored in

mtcars %>% summarize(mean_mpg = mean(mpg), sd_mpg = sd(mpg))
Error in mtcars %>% summarize(mean_mpg = mean(mpg), sd_mpg = sd(mpg)): could not find function "%>%"
library(tidyverse)
mtcars %>% summarize(mean_mpg = mean(mpg), sd_mpg = sd(mpg))
  • Note that you only need to call a package in once.

Analysis Process

1

Importing Data

  • I most often get data in .csv or .xlsx format

    • .csv \to read_csv() from readr
    • .xlsx \to read_xlsx() from readxl
  • When sharing with other people, I’ve found Google Sheets are easy

    • Okay dealing with authentication \to read_sheet() from googlesheets4
    • Don’t want to deal with authentication \to gsheet2tbl() from gsheet
  • Occasionally I get data for a specific analysis program

    • .sas7bdat \to read_sas() from haven
    • .sav \to read_spss() from haven
    • .dta \to read_dta() from haven

Importing Data

  • Google Sheet Example
library(gsheet)
surgery <- gsheet2tbl("https://docs.google.com/spreadsheets/d/1roCT2mtO-k28YV9k7ueLzpCrZ3tN4c08UhZmPzJTZfw/edit?usp=sharing")
head(surgery, n = 5)

Importing Data

  • .csv Example
  • Let’s consider the sauce data from Hot Ones,
sauces <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/sauces.csv")
head(sauces, n = 5)

Importing Data

RStudio vs. Quarto Document

RStudio vs. Quarto Document

Summary Statistics

sauces %>% 
  summarize(mean = mean(scoville),
            sd = sd(scoville),
            median = median(scoville),
            iqr = IQR(scoville))
sauces_summary <- sauces %>% 
  summarize(mean = mean(scoville),
            sd = sd(scoville),
            median = median(scoville),
            iqr = IQR(scoville))
(sauces_summary <- sauces %>% 
  summarize(mean = mean(scoville),
            sd = sd(scoville),
            median = median(scoville),
            iqr = IQR(scoville)))

Summary Statistics

  • We can use the group_by() function to request summary statistics by group(s).
sauces_summary <- sauces %>%
  group_by(sauce_number) %>%
  summarize(mean = mean(scoville),
            sd = sd(scoville),
            median = median(scoville), 
            iqr = IQR(scoville))

Summary Statistics

  • We can use the group_by() function to request summary statistics by group(s).
sauces_summary <- sauces %>%
  group_by(season) %>%
  summarize(mean = mean(scoville),
            sd = sd(scoville),
            median = median(scoville), 
            iqr = IQR(scoville))

Summary Statistics

  • Let’s now look at the episode data for Hot Ones,
episodes <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-08-08/episodes.csv')
head(episodes)

Summary Statistics

ep_summary <- episodes %>%
  group_by(season, finished) %>%
  summarise(n = n()) %>%
  mutate(freq = n / sum(n))

head(ep_summary, n = 4)

Graphs

  • Recall the summary statistics by sauce number,
sauces_summary <- sauces %>%
  group_by(sauce_number) %>%
  summarize(mean = mean(scoville),
            sd = sd(scoville),
            median = median(scoville), 
            iqr = IQR(scoville))

Graphs

  • Let’s explore the centrality of the Scoville heat units of the sauces.
sauces_summary %>% 
  ggplot(aes(x = as.factor(sauce_number), y = mean)) +
    geom_point(size = 2) + 
    labs(x = "Sauce Number",
         y = "Mean Scoville Heat Units") +
    theme_bw()

  • Da Bomb Beyond Insanity is always the 8th sauce …

  • … but it doesn’t appear much hotter than the previous sauces …

sauces %>%
  filter(sauce_number == 8) %>%
  ggplot(aes(y = scoville, x = as.factor(season))) + 
    geom_point() + 
    labs(x = "Season Number",
         y = "Scoville Heat Units") +
    theme_bw()

Graphs

  • Let’s explore the proportion of guests that finish the wing series.
ep_summary %>% 
  filter(finished == TRUE) %>%
  ggplot(aes(x = as.factor(season), y = freq)) +
    geom_point(size = 2) +
    labs(x = "Season",
         y = "Proportion that Finish") +
    theme_bw()

Wrap Up

  • Some reminders:

    • Statistician/biostatistician/data scientist first, programmer second.

    • Yes, I will challenge you on labs and projects.

      • Do not be afraid to Google!

      • Similarly, do not be afraid to ask me to verify what you find.

    • Always remember that I want you to learn, which includes learning how to function within your career.

  • Next week, we will remind ourselves about linear regression and how to construct models in R.